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The multiple sequence alignment (MSA) of a protein family provides a wealth of information in 
terms of the conservation pattern of amino acid residues not only at each alignment site but also 
between distant sites. In order to statistically model the MSA incorporating both short-range and 
long-range correlations as well as insertions, I have derived a lattice gas model of the MSA based 
on the principle of maximum entropy. The partition function, obtained by the transfer matrix 
method with a mean-field approximation, accounts for all possible alignments with all possible 
sequences. The model parameters for short-range and long-range interactions were determined by 
a self-consistent condition and by a Gaussian approximation, respectively. Using this model with 
and without long-range interactions, I analyzed the globin and V-set domains by increasing the 
“temperature” and by “mutating” a site. The correlations between residue conservation and various 
measures of the system’s stability indicate that the long-range interactions make the conservation 
pattern more specific to the structure, and increasingly stabilize better conserved residues. 


I. INTRODUCTION 

A multiple sequence alignment (MSA) of a family of 
proteins provides us with valuable information to char¬ 
acterize the protein family in terms of patterns of amino 
acid residues at alignment sites |T] . The usefulness of ana¬ 
lyzing the residue compositions in the MSA has led to the 
development of a class of sequence profile methods m 
such as PSI-BLAST [J| and profile hidden Markov mod¬ 
els (HMM) (5j, which can be used to detect distantly 
related proteins, to obtain high-quality alignments, and 
to improve structure prediction [Bj as well as to charac¬ 
terize functional and structural roles of the conservation 
pattern [7:. In the sequence profile methods, it is as¬ 
sumed that the residue composition of each site is inde¬ 
pendent of other sites. With this crude assumption, the 
conservation of residues are explained in terms of their 
functional and structural roles. However, to further un¬ 
derstand the mechanism of these roles in the context of 
protein sequences, one needs to drop the assumption of 
site independence. In fact, there seems to be no way for 
a residue to “know” that it is in a particular position in 
the sequence to play a particular functional or structural 
role other than by its interactions with other residues in 
the sequence (or with other molecules in the biological 
system). Therefore, to understand what makes particu¬ 
lar residues important at each site, one needs to study 
the correlations between different sites. 

Correlations between distant sites in a MSA can be 
quantified by identifying correlated substitutions. They 
have been exploited to gain further insights of struc¬ 
tures and functions of proteins EHH)|. However, the 
apparent correlations observed in a MSA are only a 
result of intricate interactions between residues as in 
the underlying native structures of proteins. Recently, 


* Electronic address: akinjo@protein.osaka-u.ac.jp 


there have been a number of successful attempts to 
extract direct correlations El ESI which are in fact 
found to be in excellent agreement with the residue- 
residue contacts in native structures EHEI to the ex¬ 
tent that the three-dimensional structures can be actu¬ 
ally (re)constructed |TU ITS] . 

One drawback of the direct-coupling analysis (as well 
as other direct correlation methods) is that it takes into 
account only those alignment sites that are well aligned 
(the “core” sites), and ignores insertions. The primary 
difficulty in the treatment of insertion is that they are 
of variable lengths, which makes the system size vari¬ 
able and hence greatly complicates the problem. When 
one is interested in some universal properties of a pro¬ 
tein family such as their approximate three-dimensional 
fold, insertions may be irrelevant. However, when one 
is interested in a particular member of the family, the 
existence of some insertions may be important. In fact, 
insertions, which may be regarded as “embellishments” 
to a conserved structural core, are deemed to be an ef¬ 
fective strategy for proteins to diversify and specialize 
their functions m■ Some insertions are also known 
to play critical roles in protein oligomerization mm- 
Of more fundamental concern is that ignoring insertions 
in a MSA means ignoring the polypeptide chain struc¬ 
ture, which implies theoretical as well as practical conse¬ 
quences. Theoretically, it is questionable to ignore such 
a strong interaction as the peptide bond in order to ac¬ 
curately describe the sequence and structure of proteins. 
Practically, in order to identify new members of a family 
by aligning their sequences to some MSA-derived model 
incorporating direct correlations, a consistent treatment 
of polypeptide sequences is necessary. 

In this paper, I present a new statistical model of the 
MSA that incorporates both direct correlations and in¬ 
sertions. The main objective of this model is incorpo¬ 
ration of long-range correlations into multiple-sequence 
alignment, rather than improving contact prediction by 
incorporating insertions. As will be apparent from the 
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FIG. 1: The lattice structure of the model. The squares 
marked with Oi(i = 0, • • • , N + 1) correspond to core 
(matching/deletion) sites, the diamonds marked with 7;(i = 
0, • ■ ■, N) correspond to insert sites. The edges between sites 
indicates bonded interactions. See Figure [2] for concrete ex¬ 
amples. 

formulation, this model is a generalization of the direct- 
coupling analysis that is based on the principle of maxi¬ 
mum entropy Hi. This model can be regarded as a 
finite, quasi-one-dimensional, multicomponent, and het¬ 
erogeneous lattice gas model where the “particles” are 
amino acid residues. In the following, the “lattice gas 
model” refers to this model. The lattice system con¬ 
sists of two kinds of lattice sites, corresponding to the 
core (matching or deletion) or the insert, that are con¬ 
nected in a similar, but distinctively different, manner as 
in the profile HMM model. While long-range interactions 
are treated by using a mean-field approximation, short- 
range interactions are treated rigorously so that the parti¬ 
tion function is obtained analytically by a transfer matrix 
method. One notable feature of this model is that its par¬ 
tition function literally accounts for all the possible align¬ 
ments with all the possible protein sequences, including 
infinitely long ones. Based on this model, various virtual 
experiments can be performed by changing the “tempera¬ 
ture” of the system or by manipulating the “chemical po¬ 
tentials” associated with the particles (residues) at each 
site. 

The paper is organized as follows. In Section |nj some 
basic quantities are defined and the lattice gas model of 
the multiple sequence alignment is formulated. Section 
IH] provides the details of numerical methods and data 
preparation. Section |TV| gives the results of virtual exper¬ 
iments by increasing the temperature or by introducing 
alanine point mutants. In Section [Vj limitations, impli¬ 
cations as well as possible extensions of the present model 
are discussed. 


II. THEORY 

A. Representing multiple sequence alignment as 
lattice gas system 

A MSA may be regarded as a matrix of symbols in 
which each row is a protein sequence possibly with gaps 
and each column is an alignment site. Some columns 
may contain few gaps so the residues in such positions 
may be relatively important for the protein family. Here, 


I informally define a “core” (matching/deletion) site as 
an alignment site which are relatively well aligned. The 
remaining sites are defined to be insert sites. Core sites 
are ordered from the N-terminal to the C-terminal, and 
denoted as Oi, O 2 , ■ • ■, On with N being the number of 
core sites. For convenience, the terminal core sites O 0 
and On +1 are appended to indicate the start and end 
of the alignment, as in the profile HMM [Tj. To each 
core site, either one of 20 amino acid residues or a gap 
(deletion) may be assigned, and the latter is treated as 
the 21-st type of residue. An insert site between two 
core sites Oi and Oi+i is denoted as I t . All the gap 
symbols ignored at an insert site. In the following, the 
(ordered) sets of core and insert sites are denoted as O = 
{Oo • • •, Om+ 1 } and X = {/ 0 , • • •, In}, respectively, and 
their union as S = O U X. In addition, let us define 
a set of amino acid residues allowed for an insert site 
l-i as Ar. ; = {A, - • -, Y} (20 amino acid residue types), 
and that for a core site Oi as Aoi = {A, • • -, Y, —} (20 
amino acid residues and deletion) for i = 1, • • •, N and 
Ao a = Ao n+ 1 = {—} (deletion only) for the terminal 
sites. 

For one protein sequence in the MSA, at most one 
residue may correspond to each core site Oi whereas any 
number of residues may correspond to an insert site 
In this sense, residues behave like fermions on core sites 
and like bosons on insert sites. The set of core and insert 
sites comprise a quasi-one-dimensional lattice structure 
as shown in Figure [T] In this lattice structure, two sites 
are connected if two consecutive residues in a protein se¬ 
quence (possibly including gap symbols) can be assigned. 
If two sites are directly connected, they are defined to be 
a bonded or short-range pair. The self-connecting loop 
in each insert site indicates that it makes a bonded pair 
with itself. Thus, an insertion may be indefinitely long, 
manifesting its boson-like character. 

Based on this lattice system, an alignment X of a 
particular protein sequence a = 0102 • • • at in the MSA 
may be represented as a sequence of length Ax consist¬ 
ing of ordered pairs of a lattice site and a residue of a: 
X = A'oXi • • • At x Xt x +i (“matchings” to the terminal 
sites are also included). Here, each Xk = (S,a) with 
S £ S and a £ As- A whole MSA consisting of M se¬ 
quences is a set of such aligned sequences: {X t } t= i j ... i M- 
Figure [2] shows some concrete examples of this represen¬ 
tation of alignment. 

B. Variables to characterize alignments 

Using the above representation, let us define some 
quantities that characterize an alignment in a given MSA. 
For a given lattice model and its alignment X with a pro¬ 
tein sequence, the number of the residue type a £ As at 
the lattice site S £ S is defined as 

n s(a |X) = y ^S(S,a),X k - 

k =0 


( 1 ) 
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FIG. 2: Example of a multiple sequence alignment (based 
on Hi)- Each row corresponds to a protein sequence (SI,- • 
S7) and each column to an alignment site. Below the hori¬ 
zontal line, each alignment site is annotated as to whether it 
corresponds to a core (matching or deletion) site (“O”) or an 
insert site (“I”). Indicated below these “O”/ “I” symbols are 
the position of lattice sites, (c.f. Figure [I]) The size of the 
lattice model based on this MSA is N = 8 . Insert sites other 
than Jo, I3 and Is are not explicit in this MSA. For example, 
the alignment of the sequence S2 in this figure is represented 
as X s2 = A 0 ■ • • A ' 9 = (O 0 , )(Or, V) (0 2 ,-) (O 3 ,-) (0 4 ,N) 

(Os, V) (06. D) (O 7 , E) (Os, V) (O 9 , —) where the first and last 
pairs represent the start and end of the alignment, respec¬ 
tively. As another example, the alignment of sequence S7 
is X s7 = Xo ■ ■ ■ Xu = (Oo,-) (Oi,I) (O a ,A) (0 3 ,G) (I3, A) 
(J 3 ,D) (o 4 ,n) (Os,G) (Oe, a) (O 7 , G) (O8,V) (o 9 ,-). 


This quantity is referred to as the single-site count. Sim¬ 
ilarly, the number of a pair of residue types a £ As and 
b £ As' on a bonded pair of lattice sites S and S' occu¬ 
pied by two consecutive alignment sites is defined as 

T/x 

ng S ,(a, 6|X) = E%,a),* fc %',6),x fc+1 , (2) 

k -0 

which is referred to as the bonded pair count. The single¬ 
site counts and bonded pair counts are the two fundamen¬ 
tal stochastic variables in the present theory. For later 
convenience, let us define the non-bonded pair counts as 

n^,(a,6|X) = n s (a|X)n S /(6|X) (3) 

for S, S' £ S. Note that the non-bonded pair counts 
may be defined for residues residing on neighboring lat¬ 
tice sites as well as on the same (S = S') site. The terms 
“bonded” and “non-bonded” here are meant to describe 
the connectivity along the polypeptide sequence rather 
than that along the lattice system (A pair of residues in 
neighboring lattice sites may be either bonded or non- 
bonded depending on the given alignment). From these 
definitions, several relations follow. First, by the fermion- 


like character of the core site, we have for each Oi £ O 

E no, (a|X) = 1. (4) 

aEAoi 


Between bonded pair counts and single-site count, we 
have 


E <o i+1 M|x)+ E 4,/,(MI x ) 

E < iS ,(M|x)+ E < S 'M|X) 

a-CAoi a^Ai i 


ns(a|X), 

(5) 

n S '(b\X) 

( 6 ) 


where S = Oi,I{ and S' = Oi+i,Ii. Lastly, between 
non-bonded pair counts and single-site count, we have 


E 

nf t0 .( a , b |X) 

= n s {a\X), 

(7) 

b&A 0j 




E 

nSE'MlX) 

= n S'(b\X) 

(8) 

a^iAoi 





where S, S' £ S. 


C. Probability distribution of alignments 


I would like to statistically characterize the given MSA 
in terms of the above quantities. To do so, suppose that 


the probability P(X) of an alignment X is known for 
the lattice model. Then, the expectation values of these 
numbers are defined as follows: 

n s (a ) = E P ( X ) n s(a|X), 

x 

(9) 

n b ss'M = E P (X)n^(a, b\X) 
x 

, (10) 

nfAa,b) = E P (X)n^(a,6|X) 

(11) 


x 


which are referred to as single-site (number) densities, 
bonded pair (number) densities, and non-bonded pair 
(number) densities, respectively. These number densities 
naturally satisfy the relations analogous to Eqs. Q-Q. 

To determine the form of P(X), the principle of max¬ 
imum entropy is employed with the constraints that the 
densities are equal to those observed in the given MSA. 
The entropy is given as 

s = -E p ( x ) lnp ( x )- ( 12 ) 

X 


Let us denote the densities estimated from the give n M SA 
as ng(a), n b ss ,(a,b), and fig b s ,(a,b) (see Section III for 


the method to obtain these quantities). The following 
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Lagrangian, consisting of the entropy (Eq. [l2]) and the Let us define the mean field Ks(a) acting on the residue 
constraints for the densities, is maximized: type a on the site S: 


£ = -T^P(X) lnP(X) 
x 

b-P- _^ 

+ E Jss ' t n SS' K b) - n b ss , (a, b )] 

(S,S') a,b 

+^EE Kss' (a, b) [rig b s , (a, b) — rig b s , (a, b )] 

S,S' a,b 

+ '52f J -s{a)[n s (a) - n s {a)] (13) 

S,a 

where jUg(a), Jss'(a, 6) and A"ss'(a,6) are undetermined 
multipliers, and the summation J2(s s') over bonded 
pairs. We have also introduced the “temperature” pa¬ 
rameter T. Solving 5£/5P(X.) = 0 leads to the Boltz¬ 
mann distribution: 

P(X) = e *Pl- E i X >/ T ] , (14) 

where 5 is the normalization constant or the partition 
function defined by 


S = ]Texp[-A(X)/T], (15) 

x 

and A(X) is the “energy” of the system given as 

b.p. 

A(X) = -£ Y / J ss'(a,b)n b SS '(a, 6|X) 

(5,5') a,b 

-^EE K ss >(a,b)ns(a\X.)ns>{b\X) 

Z S,S' a,b 

-^2^s(a)n s {a\X). (16) 

S,a 

From this expression of the energy function, we can in¬ 
terpret /tg(a) as the chemical potential imposed on the 
particle (amino acid residue) a at the site S, and J and AT 
as bonded and non-bonded coupling parameters, respec¬ 
tively. The problem of obtaining the probability distri¬ 
bution -P(X) is thus reduced to computing the partition 
function 5. In the following, the non-bonded interactions 
are considered only between core sites (i.e., core-insert 
and insert-insert pairs are discarded) for a technical rea¬ 
son (see the subsection “Determining the K matrix” be¬ 
low). 


D. Partition function 

In this subsection, I assume that the parameters /q J 
and K are fixed. To treat the long-range interactions, a 
mean-field approximation is applied. Then, the partition 
function can be computed by a transfer matrix method. 


Ks{a) = ^ K S s’ ( a , b)[ns>(b) - n s < (5)] (17) 


S’,b 


where ns'(b) is subtracted for convenience, but this does 
not essentially change the system’s behavior (it simply 
shifts the chemical potential /is (a) which can be com¬ 
pensated by J; see Eq. 18 and Section IIG). Next, let 


us define the transfer matrices between a bonded pair of 
sites S = Oi,Ii and S' = Oi + i,Ii as 

Tss'(a,b) = exp[{J S s>(a,b) + fj, s >(b) + K s >(b)}/T). 

(18) 


To alleviate the expressions for the partial partition 
functions, a bracket notation is introduced. First, define 
a set of standard basis vectors: (a| and |a) corresponding 
to each residue type a on each site. These vectors satisfy 
the following orthonormal properties: 

(a\b) = 5 a , b, (19) 

|a) (a| = I|^ s | (identity matrix) (20) 

a£As 

where I|^ s | is the |„4,s|-dimensional identity matrix. For 
each site i. I define the partial partition functions (Oi\ 
and (Ii\ that count the statistical weight of all possible 
alignments starting from the start site Oq and terminat¬ 
ing at Oi and £, respectively. Similarly, partial partition 
functions | 0*} and | /,) account for all possible alignments 
“starting” from the end site Ojv+i and “terminating” at 
Ot and £. Any (complete) alignment starts at the start 
site O 0 and ends at the end site Ojy+ 1 , and these sites 
are formally treated as “deletion (-).” Therefore, the 
boundary conditions are given as 

(Ool = H = (0, • ■ •, 0,1), (21) 

|Ojv+i) = |-> = (0,---,0,1)*. (22) 

Based on this setting, the recursion formulae for partial 
partition functions are given as 

<Oi+l| = (Oi\T 0i o i+1 + (Ii\T Ii0i+1 , (23) 

(Ii I = {OilTotU + {Ii\T Iih (24) 

in the forward (N- to C-terminal) direction, and 

I Oi) = T 0i0i+1 \O i+ i) + T 0ih | h) , (25) 

| U) = T Ii0i+1 \O i+1 ) +T IiIi \I i ) (26) 

in the backward (C- to N-terminal) direction. Here, each 
transfer matrix Tgs 1 is viewed as a |As| x |As'| matrix 
with (a\Tss'\b) = Tss'{a,b). By expanding Eq. ( [24] ) , we 
have 


(Ii\ = (Oil T 0i ii (l + T uii + Tfiit H-) (27) 

= (OilTo./^I-T^)- 1 (28) 
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where I = I 20 (the 20-dimensional identity matrix). Sim¬ 
ilarly, we have 

\h) = (I — Ti iIi )~ 1 T Ii0i+1 \Oi +\). (29) 


Thus, (Ii | and |/j) indeed include contributions from in¬ 
finitely long insertions. The inverse matrix (I — Tj.j . ) _1 
exists if the spectral radius of Tj.j. is less than 1. 

Using Eqs. (281 and (29), the recursions can be explic¬ 
itly solved as 


l 



(Oi+ii = (Ooin^w, 

k—0 

N 

(30) 


\Oi) = JJ t7fc,fc+i |Ojv+i) 

k=i 

(31) 

where 

Ui,i+ 1 

= ToiOi+i + ToiU (I - T IiI .) 1 T Ii o i+1 ■ 

(32) 


where 



3 -1 


-OiOj 

= JJ £4,k+li 

(40) 


k=i 


-Oil, 

= =■().(> T 0 1 (I - T/^.) -1 , 

(41) 

5 ho, 

= (I-Ti i i i )- 1 T I . 0 . +1 E 0i+1 o i > 

(42) 

[I] 

= (I - Tf./.) _1 r/ i o i+1 So, +1 / J .. 

(43) 


However, Eq. (39) is not used in practice for the reason 
described below(Section HID). This expression should 
be considered as an artifact of the present approximation 
on the one-dimensional lattice system. In fact, under the 
mean-field approximation, one should have n (a, b) = 
ns{a)ns'(b), but this does not hold for Eq. (39). 


F. Thermodynamic functions 


Finally, the total partition function is obtained as 

N 

s = (Ool Uk,k+l\ON+l) • (33) 

fc=0 


Several “thermodynamic functions” are defined for 
quantifying the stability of the system under perturba¬ 
tions. First, the free energy function 

Q = -T In 2 (44) 


E. Expected densities 


Let us now compute the expected densities. From the 
definition of the partition function (Eq. 151, the following 
equalities hold for single-site and bonded pair densities: 


should be regarded as a grand potential because align¬ 
ments of varying lengths are considered in the ensemble. 
This free energy is a measure of the likelihood of align¬ 
ments expressed in terms of the number densities. By 
rearranging Eq. ( |14[ ) and averaging over all alignments, 
the free energy can be decomposed as 


T ain5 
d/j, s (a) 
d In 5 
dJss'(a,b ) 


n s {a), 

n SS'( a , &)• 


(34) 

(35) 


By explicitly calculating the left-hand sides of these equa¬ 
tions using Eq. (33), we have, for S = O^Ii and 
= OiA 


'i-\- 1 1 -Li 5 


n s (a) = 

(%> (a\S) 

(36) 

:>(a,b) = 

<%> (a\T ss ,\b) {b\S f ) 

(37) 


It is readily proved that these expressions satisfy the rela¬ 
tions between bonded pair and single-site densities (Eqs. 

I”©- 

It is also possible to derive an analytical expression for 
the expected non-bonded pair densities from 

T2 a^aWs. W) = nfs ’ (“-V< b >- < 38 > 

That is, 


fl = U — TS — G (45) 

where U, S and G are the internal energy, entropy and 
Gibbs energy of the system. The internal energy of the 
system is given as 


U = U b + U nb (46) 

where U b and U nb are bonded and non-bonded energies, 
respectively, defined (under the mean-field approxima¬ 
tion) by 


b.p. 

Ub = ~ E J ss'(a,b)n b ss ,(a,b), (47) 

(5,5') a,b 

Unb = rj EE K s {a)[n s {a) - n s (a)]. (48) 

S a 


These correspond to the first two terms on the right-hand 
side of Eq. (161. The internal energy represents the mean 
“direct” interactions (bonded and non-bonded) between 
sites. The Gibbs energy is defined as 


„»», M) = (Sl°> (<■!=«■! W') (39) 


G = '^ns{a)n s (a), (49) 

S',a 
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and this quantity represents the work exerted by the 
chemical potential to maintain the single-site densities. 
Finally, the entropy is given as 


S = (fi- U + G)/T 


(50) 


which is equivalent to the entropy in Eq. (12 1 and thus 
is a measure of randomness of the alignments. 

The temperature T is set to 1 and the chemical poten¬ 
tials are set to 0 for all S £ S, a £ As when the parame¬ 
ters J (and I\) are determined. This state is referred to 
as the reference state in the following. 


G. Gauge fixing 


The relations among the densities (Eqs. [4|-[8]) indicate 
that not all the parameters, /Lt, J, and K. are indepen¬ 
dent. When determining or changing the model param¬ 
eters, we may therefore fix some of them to arbitrary 
values without losing generality. From the normalization 
condition (Eq. [4]) of core sites, it is always possible to set 

Mo,R= 0 (51) 

for all the sites O; £ O (“—” stands for the deletion). 
From this and the relations Eqs. § and @, it is always 
possible to set 


JOiO i+ 1 (—, —) = 0 . 


(52) 


Although there are other degrees of freedom that can be 
also fixed, they are not relevant to the present study so 
I will not fix them. 

Furthermore, at the reference state, I set all ns( a ) to 
zero. This is possible because any values of ns'(b) may 
be absorbed into Jss 1 (a, b) when determining the param¬ 
eters (c.f., Eq. 181. Following the convention of Morcos 
et al. El, I also set 

K 0i 0j ( ~,b) = K 0i0j (a, -) = 0 , (53) 

for all a £ Ao and b £ Ao ■ 


III. MATERIALS AND METHODS 

A. Data preparation and determining lattice 
structure 

I have downloaded the MSA’s and profile HMM’s 
for the globin (PF00042) and (immunoglobulin) V-set 
(PF07686) domains from the Pfarn database (version 
28) [2J3]. For the globin domain, the full alignment of 
17,947 amino acid sequences were used. For the V-set 
domain, the full alignment of of 23,976 sequences was 
used. In addition, I have downloaded 17 families from 
the top 20 largest Pfarn families with the model length 
of less than 300 sites. For these 17 families, the repre¬ 
sentative set of alignments (with 75% sequence identity 
cutoff) were used due to the large size of the alignments. 


In the present study, the lattice structure of a MSA 
was derived from the corresponding Pfarn model. That 
is, each core site corresponds to a profile HMM match 
state, and each insert site to a profile HMM insert state. 


B. Observed densities 


The simplest way to estimate the single-site, bonded 
and non-bonded pair densities from a MSA of M se¬ 
quences is to approximate P(X) = 1/M for all the M 
sequences. In practice, I used pseudo-counts as well as 
sequence weights as in Morcos et al. m to improve the 
robustness of the estimates. Let there be M aligned se¬ 
quences, {X t } t= i i ... i M, in a given MSA and suppose the 
structure of the lattice system has been set. The observed 
densities are defined as follows: 


n s (a) 


n ss' (a, b) 


n ss' («, b ) 


_7 + y 

qs y x m t J ’ 

7 | y' Rg^a^lX*) 

2 qsQs' m t 

7 _j_ Agg>(a, &|X*) 

qsqs' m t 


(54) 

(55) 

(56) 


where S £ S , qs = |-As|, 7 is the pseudo-count, m t is the 
number of sequences in the MSA that are highly homol¬ 
ogous (> 80% sequence identity) to the sequence f, and 
C = 1/(7 + Yjt 1 / m t ) with 7 = 0.1 J2 t l/ m t- Note that 
these estimated densities satisfy the relations analogous 
to Eqs. Q Q. 


C. Determining the J matrices 


As mentioned above, the temperature is set to unity 
(T = 1) in the process of parameter determination. To 
determine J, Eq. (37) is rearranged to have 


Jss'(a,b) = log 


r nR(a, b)l 
(■S\a ) (b\S >)J 


(57) 


where it is assumed ns'(b) = 0 and Kg'(b) = 0 for all 
S' £ S and b £ As< (see Section IIG I. Setting Kg' (b) = 0 


is possible because the expected number densities are 
set to the observed values (see Eq. [0]). By replacing 
n b SS f( a ,b) with the observed value n b s ^7(a,b), one can it¬ 
eratively update the values of J and compute the parti¬ 
tion function until this equation actually holds. In prac¬ 
tice, a relaxation parameter is introduced a to improve 
the stability of convergence. Thus, from the i/-th step 
of iteration, the next updated value is obtained by the 
following scheme. 


J'ss'( a ’ b ) = log 


r n b ss ,(a,b) 3M 


(58) 


.(SRa) (b\S'(A) 

47 1] M = (1 — oi)Jgg, (a, b) + aJ'ss' ( a , b). (59) 























7 



which is a self-consistent equation for A;. Thus, first A i is 
set to a sufficiently large value and compute the partition 
function and expected densities. Then, A^ is updated by 
Eq. (62), and by using the updated value of Aj, we again 


compute the partition function and expected densities. 
This process is repeated until the value of \ converges. 
After the convergence of Aj for all i, Jj.j. is updated 
as in Eq. (37) without including Aj. In this way, the 


contribution of A, is incorporated into the updated value 
of ./ j, q • and A,; will eventually converge to 1, and hence 
may be omitted in later calculations. 

The overall procedure for determining the J matrix 
is shown in Fig. [3] In this procedure, the given data 
are the observed densities and initial values for J and 
A j. After the partition function and expected densities 
are computed, A, is iteratively updated. After has 
converged, J is updated. Convergence is checked based 
on the difference of the expected bonded pair densities 
from their observed values: when the root mean squeare 
between the two densities is less than 10” 13 , the iteration 
is stopped. 


D. Determining the K matrix 

In this study, only those between core sites are taken 
into account for non-bonded interactions. Including non- 
bonded interactions with insert sites is numerically unsta¬ 
ble because the spectral radius of Tj, j, may easily exceed 
1. Noting the gauge fixing (Eq. 53), we first determine 


FIG. 3: Flow chart for determining the J matrix parameters. 


KoiOj (a, b) viewed as a 207V x 207V matrix (consisting of 
TV x TV blocks of 20 x 20 submatrices) by discarding the 
rows and columns including deletion. Then, by fixing the 
values of A', we determine the J matrices. 

Let the observed covariance matrix of single-site counts 
be C: 


I found the values a = 0.1 ~ 0.3 were effective. 

Determining Ji i i i necessitates a special treatment due 
to the requirement that the spectral radius of the transfer 
matrix Tj i j i must be less than 1 (see Eq. 28). In order 


to force I — Ti i i i to be invertible, a parameter A; > 0 is 


introduced such that \\TijJ\iW < 1. Then Eq. (37) for 
S = S' = Ii becomes 


nVAaM = 


(Ajq) (a\Tj i i i \b) (b\I x ) 
SA t 


Let us define the “loop length” k as 


'<= £ 

a,b£Ai- 


n 


Iih 


(■ a,b) 


(60) 


(61) 


and denote its observed counterpart by /,. By imposing 
k = k we have 


A,; = 


(h\ T uu |A) 
EL 


(62) 


CotOjia^b) = n,Q. 0 . (a, b) - n 0l {a)n 0j (b). (63) 

In a similar manner as in Morcos et al. m, one could ap¬ 
ply the Plefka expansion mmm to the grand poten¬ 
tial (Eq. [Tij ) with K = 0 as the reference state. However, 
I found that thus obtained K made the system unstable 
under very weak perturbations. This behavior is perhaps 
due to the incompatibility of the mean-field approxima¬ 
tion with the one-dimensional system (see the remark at 
the end of Section HE). In order to cope with this prob¬ 


lem, I employ the following Gaussian (harmonic) approx¬ 
imation. By assuming the single-site densities are Gaus¬ 
sian random variables yielding the observed covariance, 
the non-bonded coupling is given as 


k = -cr 


(64) 


which is identical to that derived by Morcos et al. EH us¬ 
ing the Plefka expansion, except for the diagonal blocks 
(i.e., KoiOi )■ Unlike their case (where the diagonal 
blocks are defined to be zero), I use the expression for 





























K as in Eq. (64) including the diagonal blocks. The sys¬ 
tem was again found to be unstable when the diagonal 
blocks (and those for bonded pairs) of K were set to zero. 
This approximation makes the K matrix negative semi- 
definite so that the observed single-site densities are the 
most stable ones and there are no other optima as far as 
non-bonded pairs are concerned. 


E. Self-consistent solutions with fixed parameters 


To obtain a self-consistent solution for the recursion 
equation (Eqs. [23 261 with a given set of parameters fj,, 
J and K , we first set the mean-field Kg{a) = 0 for all 
S and a. Then compute the partition function and th e 
expected densities ns (a) and update Kg(a) by Eq. (17). 
This process is repeated until convergence. In practice, 
however, I do not use this self-consistent solution (see 
below). 


F. Self-consistent solutions with fixed sequence 
length 

Note that our partition function is that of a grand 
canonical ensemble so the total number of particles 
(residues) can vary. In practice, however, it is prefer¬ 
able to fix the sequence length for comparing different 
conditions to be meaningful. This can be achieved by 
adjusting the chemical potentials. First, let us define the 
sequence length as the number of particles in the system: 



20 

L = Y J2 n s(a)- 

SGSa =1 


FIG. 4: Flow chart for obtaining the self-consistent solution 
(65) with fixed sequence length (and fixed single-site densities). 


Note that the deletion is not included here (i.e., a = 21 
when S = Oi). Let L denote the target sequence length 
(a constant). At every step of self-consistent calculation, 
update the chemical potential of each residue (except for 
deletion) by 


subtracting this value from those of other residue types 
of the same site. When the sequence length is to be fixed 
as well, both Eqs. ([66|) and (|67|) are applied. 


/i s (a) = fis(a) + e(L - L) (66) 

where e is a small positive constant (e ss 0.001). 


G. Self-consistent solutions with fixed single-site 
densities 

In virtual alanine scanning experiments, the single-site 
densities of particular sites is specified. Given densities 
ns (a) for all a € As for a particular site S can be spec¬ 
ified by adjusting the chemical potentials at every itera¬ 
tion of the self-consistent calculations: 

Ms(a) = A*s(«) + e'[ns{a) - ns {a)] (67) 

where e' is a positive constant (e' ~ 10). For the case 
of core sites, it is always possible to set //$(—) = 0 by 


H. Measures of site conservation and difference 


A measure of site conservation is the site entropy [23] 
defined by 


H 0i = - Y n °i ( a ) ln n Oi ( a ) (68) 

a 


for the reference state. The more well-conserved a site, 
the lower the value of the site entropy. The difference 
between the reference state and a perturbed state is mea¬ 
sured by the Kullback-Leibler divergence [23] : 


D 0i = Y, n °i (°) ln 

a 


no i (a ) 
no, ( a ) ’ 


( 69 ) 
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and the total divergence is defined by 

N 

d = J2 d °>- ( 7 °) 

i—1 


IV. RESULTS 


I now study the behavior of the lattice gas model of 
multiple sequence alignment by varying temperature or 
by “mutating” a site. I mostly focus on the effect of non- 
bonded interactions in the following. For this purpose, I 
compare the system including both the bonded and non- 
bonded interactions (referred to as the “ J+K ” system in 
the following) with that including only the bonded inter¬ 
actions (the “J-only” system). The calculations for the 
J-only system were performed by simply discarding the 
mean-field, which is justified due to the present definition 
of the mean-field (Eq. |l7| ). 

All the calculations in the following are based on the 
“fixed-length” solution, and the sequence length (Eq. 65) 
was constrained to that of the reference state. 


A. Temperature scanning 

Note that the present model does not exhibit phase 
transition due to the Gaussian approximation of the non- 
bonded pair interactions. That is, the K matrix is neg¬ 
ative semi-definite so that there exists one and only one 
minimum for the non-bonded interactions (i.e., at the 
observed single-site densities). Nevertheless, solving the 
self-consistent equation with varying temperatures helps 
to understand the behaviors of interactions. At high tem¬ 
peratures, all the interactions are effectively weakened. 
This can be regarded as an idealization of uniform ran¬ 
dom mutations along the protein sequences of the given 
family. By observing the residue compositions perturbed 
by increased temperature, we can see which sites are more 
robust under the perturbations. 


then it starts to decrease (Figure [5jA) . Decomposing the 
free energy (Eq. |45| shows that both the internal en¬ 
ergy (Figure [5^3) and entropy (Figure [5p) increase with 
temperature. On the other hand, the Gibbs energy (Eq. 
|49| ) monotonically decreases with increasing temperature 
(Figure §>), indicating that the sequence length tends 
to be longer for higher temperature. This can be un¬ 
derstood from the definition of the transfer matrix Tj.j.. 
Since \\Tjj. || < 1 is required, Jjj, {a,b) < 0 holds for all 
a,b £ Aj- ( Ii £ I) so the increased temperature poten¬ 
tially allows a larger number of residues to reside at insert 
sites. In order to fix the sequence length, the chemical 
potential must be negative, and hence the negative Gibbs 
energy. 

The behaviors of the J + K and J-only systems appear 
similar regarding the free energy, internal energy, entropy 
and Gibbs energy. To see the effect of non-bonded inter¬ 
actions more closely, the internal energy was decomposed 
into bonded interactions and non-bonded interactions for 
the J+A" system (Figure]^). It appears that the increase 
in non-bonded energy is more than an order of magnitude 
smaller (Figure 5fe, blue line) compared to that of bonded 
energy (Figure 5E, magenta line). Furthermore, the di¬ 
vergence (difference of residue distributions from the ref¬ 
erence state) shows a relatively large difference between 
the J + K and J-only systems (Figure [5^). Thus, the 
non-bonded interactions are very stable under increased 
temperatures, and they greatly stabilize the residue com¬ 
position. 

A closer examination of each site (at T = 1.2) shows 
that the magnitude of the divergence of the J-only sys¬ 
tem is about three times as large as that of the J + K 
system (Figure [6]). The broad peaks of the divergence 
roughly correspond to regions of a-helices. Furthermore, 
with non-bonded interactions, finer peaks match the pe¬ 
riodicity of the helices (3 to 4 residues) whereas such pe¬ 
riodicity is not observed with the J-only system. Thus, 
non-bonded interactions seem not only to stabilize the 
residue composition, but to make the composition more 
specific to the structure of the domain. 


1. Globin domain 


2. V-set domain 


The globin domains are found in a wide variety of 
organisms ranging from bacteria to higher eukaryotes. 
Two of the most famous family members are myoglobins 
and hemoglobins both of which bind the heme prosthetic 
group. Structurally, globins belong to the class of all-a 
proteins, The lattice gas model of the globin domain con¬ 
sisted of 110 core sites (excluding the termini) and 111 
insert sites. 

The self-consistent equation was solved for tempera¬ 
ture ranging from T = 1.0 to T = 1.7. Above the latter 
temperature, the solution could not be obtained stably 
because the spectral radius of some T/,./,. exceeded 1. 

As the temperature increases, the free energy (grand 
potential, Eq. 44) increases up to around T = 1.15 and 


The V-set domains are found in many proteins the rep¬ 
resentative members of which are immunoglobulin vari¬ 
able domains. The lattice gas model of this domain con¬ 
sists of 114 core sites (excluding the termini) and 115 
insert sites. Structurally, they belong to the all-/3 class 
having a /3-sandwich structure. 

The same procedures were applied to the V-set do¬ 
main as the globin domain. In this case, however, self- 
consistent solutions could be obtained only for temper¬ 
atures T < 1.25. This may be due to a long inser¬ 
tion allowed at the insert site Ig (average length of 23.5 
residues). Other than this limitation, the results were 
found to be qualitatively similar to the case of globins 
(Figure[7]4-D). However, the free energy decrease is more 
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FIG. 5: Temperature scanning of the globin domain. (A) Free energy difference AQ. from the reference state (T = 1). (B) 
Internal energy difference A U. (C) Entropy difference AS. (D) Gibbs energy difference AG. (E) Decomposition of internal 
energy difference into bonded and non-bonded energy differences. The value of non-bonded energy difference (blue line) is 
multiplied by 10. (F) Total divergence of the core site compositions from the reference state (c.f., Eq. 701. 


pronounced for the J + K system, compared to the case 
of the globin. Again, while the increase in temperature 
hardly changes the non-bonded energy (Figure [7)E), the 
difference of the total divergence between the J + K and 
J-only systems is significant. 

A close examination of individual sites at T = 1.2 


also indicates that inclusion of non-bonded interac¬ 
tions greatly suppresses the divergence, and broad peaks 
roughly correspond to secondary structure elements (in 
this case, /3-strands). With the non-bonded interactions, 
finer peaks appear to match with the periodicity of /?- 
strands (2 residues). Therefore, the conclusion drawn 













11 



Site 


FIG. 6: Divergence of core sites of the globin domain at T = 1.2 (c.f., Eq. |69| ). Gray bars indicate sites annotated as helices 
(a-helix, “H” or 3io-helix, “G”) according to the Pfam model annotation (PF00042). The values for the J + I\ system are 
multiplied by 3. 


for the globin domain applies also to the V-set domain. 
That is, the non-bonded interactions act to stabilize the 
residue composition as well as to make composition more 
specific to the structure of the domain. 


B. Alanine scanning 


As opposed to global perturbations such as increased 
temperature, local perturbations helps us to examine the 
contribution of individual sites. Local perturbations can 
be imposed by biasing the residue composition at a site 
of interest. In this subsection, the composition of a par¬ 
ticular core site was biased in such a way that single¬ 
site density was set to 0.95 for alanine and to 0.0025 for 
all other residue types (including the “deletion” residue 
type). This residue composition can be achieved by ad¬ 
justing the chemical potential ^oAo)- When the site Oi 
is constrained in this way, the corresponding equilibrium 
state is referred to as the “A; mutant” in the following. 


1. Globin domain 

Comparing the free energy difference between the 
J + K and J-only systems, it is immediately noticed that 
the ranges of A12 are very different between the two; the 
former being an order of magnitude larger than the lat¬ 
ter. While a large number of alanine mutants for both 
the J + K and J-only systems (82 and 101, respectively, 
out of 110) exhibit Af2 < 0 (i.e., favorable mutants), the 
former (J + K) shows a larger number of unfavorable 
(AO > 0) alanine mutants. Apart from the absolute 
values, the two systems appear to be correlated except 
for the region from the site 40 to 50 where secondary 
structures are sparse (c.f., Figure [6]). In addition, they 
seem to be negatively correlated with site entropy (Fig¬ 
ure [ToJA) : Highly conserved sites tend to have high AO 
values (correlation coefficients, CC, were -0.60 and -0.57 
for the J + K and J-only systems, respectively). Thus, 
despite the great difference in magnitudes, the J + K sys¬ 
tem and J-only system appear to be similar in terms of 
free energy difference. Behind this apparent similarity, 
however, exist different mechanisms, as we shall see in 
the following. 

While internal energy difference, A U, also shows a sim¬ 
ilar correlation as AO (Figure|9j3), entropy difference ex- 
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Temperature 


Temperature 


FIG. 7: Temperature scanning of the V-set domain. (A) Free energy difference AS! from the reference state (T = 1). (B) 
Internal energy difference AU. (C) Entropy difference AS. (D) Gibbs energy difference AG. (E) Decomposition of internal 
energy difference into bonded and non-bonded energy differences. The value of non-bonded energy difference (blue line) is 
multiplied by 10. (F) Total divergence of the core site compositions from the reference state (c.f., Eq. 701. 


hibits different, somewhat opposite, trends (Figure [9p). 
In fact, the relations between the internal energy and en¬ 
tropy are completely different between the J + K and 
J-only systems (Figure [l0f3). While AU and AS are lin¬ 
early and positively correlated (CC = 0.99) for the J-only 
system, they relation is more complicated for the J + K 


system: a positive correlation for AU < 20 (CC=0.65) 
and a negative correlation for AU > 30 (CC=-0.69). 
The region AU < 20 corresponds to that spanned by 
the J-only system, and therefore is considered to be the 
region where local (bonded) interactions are dominant 
in AU. This in turn indicates that a large increase in 
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Site 


FIG. 8: Divergence of core sites of the V-set domain at T = 1.2 (c.f., Eq. |69| ). Gray bars indicate sites annotated as extended 
strand, “E” according to the Pfam model annotation (PF07686). The values for the J + K system are multiplied by 3. 


nonlocal (non-bonded) interactions greatly restricts the 
residue composition throughout the globin domain. In 
fact, unlike the case for temperature scanning (Figure 
[5]), the perturbation by a point mutation induces a large 
increase in non-bonded energy that is comparable with 
that of bonded energy in the J + K system (Figure [9^3). 

The Gibbs energy difference, AG, reveals a sharp con¬ 
trast between the two systems (Figures [9p and [Top). 
The Gibbs energy differences of the J + K system are 
clustered below AG < 50, but has a long tail towards 
higher values (skewness was 1.1). On the other hand, 
those for the J-only system are more or less symmetri¬ 
cally distributed around AG = 5 (skewness was -1.4). 
The correlation between AG and site entropy is evident 
for the J + K system (CC = -0.71), but is nearly absent 
for the J-only system (CC = -0.18) (Figure [Top) . 

The total divergence shows a trend similar to the Gibbs 
energy difference in that its values are clustered at lower 
values and has a long tail towards higher values for the 
J+K system, and that such is not the case for the J-only 
system (Figure [9jT). Although in the both systems the to¬ 
tal divergence is well correlated with site entropy, the cor¬ 
relation is higher for the J + A' system (CC = -0.78) than 
for the J-only system (CC= -0.71) (Figure |l0p). In the 
J-only system, each mutation perturbs the residue com¬ 
positions only locally around the mutated site, whereas 


in the J + K system, a mutation at one site perturbs 
many sites across the the entire domain. As a result, 
the contrast between the effects of mutations at highly 
conserved sites and less conserved sites is higher for the 
J + K system than for the J-only system. 


In the globin domain, the two most highly conserved 
residues are phenylalanine (Phe) at site 38 (Hoi = 0.67) 
and histidine (His) at site 91 (Ho* = 0.64). The alanine 
mutants at these sites show large differences in Af l (Fig. 
10 4), A U (the two points with the largest AG in Fig. 
10 3) and AG (Fig. (lOp). According to a detailed study 
by Ota et al. [24], these two residue are conserved for dif¬ 
ferent reasons: Phe at site 38 (“CD1” in [53]) is conserved 
for structural stability whereas His at site 91 (“F8”) is 
conserved for the heme-binding function at the cost of 
structural stability. While it is reasonable to observe 
that the A 38 mutant of the structurally conserved Phe 
significantly disturbs the system, the present result sug¬ 
gests that the Agi mutant of the functionally conserved 
His is also maintained by a significant amount of interac¬ 
tions with other sites. This may indicate the importance 
of structural scaffold to maintain protein function. 


























































































14 








FIG. 9: “Alanine scanning” of the globin domain. The horizontal axis indicates the site at which the single-site density of a 
core site was set to 0.95 for alanine, and to 0.0025 for other residue types; the vertical axes indicate associated values (A)-(F), 
with the J + K system on the left axis, and J-only system on the right. (A) Free energy difference of “alanine point mutants” 
from the reference state. (B) Internal energy difference. (C) Entropy difference. (D) Gibbs energy difference. (E) Decomposed 
internal energy difference. (F) Total divergence of core sites (Eq. 701. 


2. V-set domain 


The case for the V-set domain is mostly similar to that 
for the globin domain (Figures 11 and 121. However, 


there are some marked differences to be noted. First, the 


free energy differences Af 1 due to alanine mutations take 
both positive and negative values for the J + K system, 
but only negative values for the J-only system. The pos¬ 
itive values for the former corresponds to relatively well- 
conserved sites, as can be seen in Figure [12]A. In fact, the 
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Site entropy, H 0 . 


D 



Site entropy, H 0i 


FIG. 10: Correlations between various quantities for the globin domain. (A) Site entropy, Hoi , vs. free energy change, 

(left vertical axis for the J + K system, right vertical axis for the J-only system). (B) Internal energy difference, AU, vs. 
entropy difference, AS. (C) Site entropy, Ho it vs. Gibbs energy change, AG (left vertical axis for the J + I\ system, right 
vertical axis for the J-only system). (D) Site entropy, Hoi, vs. total divergence, D (left vertical axis for the J + K system, 
right vertical axis for the J-only system). 


correlation between Af l and site entropy is significantly 
higher for the J + K system (CC = -0.72) than for the J- 
only system (CC = -0.52). Second, while the correlation 
between internal energy and entropy differences is linear 
and positive for the J-only system (CC = 0.96) as was 
the case with the globin, that for the J + K system of the 
V-set domain shows only a negative trend for the entire 
range of AU (CC = -0.92). Third, the contrast of the 
Gibbs energy difference is far more pronounced (Figure 
the skewness was 1.8 for J+K and -0.37 for J-only) 
and its correlation with site entropy is very high for the 
J + K system (CC = -0.80) whereas it is negligible for 
the J-only system (CC = -0.08) (Figure [l2|j). Similarly, 
as for total divergence, the J + K system shows sharper 


contrast (Figure [TT^) and higher correlation with site en¬ 
tropy (CC = -0.80, Figure [l2p) than the J-only system 
(CC = -0.67). 

Thus, compared to the case with the globin, the dif¬ 
ferences between the J + K and J-only systems are 
more pronounced. This may be due to the difference 
in the structures of these domains. The globin domain 
has an all-cc fold in which local interactions in a-helices 
are prominent, whereas the V-set domain has an all-/3 
fold in which nonlocal interactions between /3-strands are 
prominent. This difference may be reflected in the non- 
bonded interactions of the lattice gas model, hence the 
pronounced difference between the J +K and J-only sys¬ 
tems. 
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FIG. 11: “Alanine scanning” of the V-set domain. The horizontal axis indicates the site at which the single-site density of a 
core site was set to 0.95 for alanine, and to 0.0025 for other residue types; the vertical axes indicate associated values (A)-(F), 
with the J + K system on the left axis, and J-only system on the right. (A) Free energy difference of “alanine point mutants” 
from the reference state. (B) Internal energy difference. (C) Entropy difference. (D) Gibbs energy difference. (E) Decomposed 
internal energy difference. (F) Total divergence of core sites (Eq. 701. 


3. Other protein families 

To confirm the observations made above, alanine scan¬ 
ning was performed for 17 Pfam families that are the 
largest in the number of family members and are of model 


length of less than 300 sites. The free energy difference, 
Aftends to have more positive values for the J + K 
system than for the J-only system (Fig. ??, cf. Figs. dh 
andllltl). The skewness (i.e., the standardized third mo- 
mentjof AG(Ai) consistently have positive values for the 
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FIG. 12: Correlations between various quantities for the V-set domain. (A) Site entropy, Hoi, vs. free energy change, 

(left vertical axis for the J + K system, right vertical axis for the J-only system). (B) Internal energy difference, AU, vs. 
entropy difference, AS. (C) Site entropy, Ho it vs. Gibbs energy change, AG (left vertical axis for the J + K system, right 
vertical axis for the J-only system). (D) Site entropy, Hoi, vs. total divergence, D (left vertical axis for the J + K system, 
right vertical axis for the J-only system). 


J+K system whereas it can be either positive or negative 
for the J-only system (Fig. |T3]B). The negative correla¬ 
tion between site entropy and A G(Ai) was also clear for 
the J + K system whereas such was not the case for the 
J-only system (Fig. [13)3). Thus, the trend that the non- 
bonded interaction enhances correlation with sequence 
conservation seem to hold generally. 


V. DISCUSSION 

One of the fundamental assumptions of the present 
lattice gas model is that alignment sites can be classified 
into core sites and insert sites. Although this classifica¬ 


tion may be ambiguous to some extent, once the classi¬ 
fication is made, the lattice structure is uniquely deter¬ 
mined. While the lattice structure reflects the chemical 
structure of polypeptide chains, interactions between the 
lattice sites are not limited to those that are local along 
the chain. The principle of maximum entropy allows the 
model to treat bonded (local) and non-bonded (nonlo¬ 
cal) interactions in a coherent manner. In comparison, 
the profile HMM PQ shares a similar lattice structure as 
the lattice gas model, but it cannot treat nonlocal in¬ 
teractions due to its assumption of the Markov process 
along the lattice structure. On the other hand, the direct- 
coupling analysis (as applied to contact prediction) [IT] , 
which casts a MSA as a Potts model [23, simply ignores 
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FIG. 13: Alanine scanning of 17 Pfam families. (A) Free energy difference (cf. Figs. [9j\ and |ll[ \). (B) Skewness (standardized) 
of AG (cf. Figs. [9j3 and|llp). (C) Correlation coefficient between site entropy and Gibbs energy AG (cf. Figs. 10 S and' 
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insert sites so that it cannot faithfully represent polypep¬ 
tide chains. Threading methods |26| or conditional ran¬ 
dom held models m can combine the polypeptide struc¬ 
ture with nonlocal interactions, but such integration is 
often ad hoc because there are no well-defined rules or 


principles for determining the relative contributions of 
various interactions. It is possible to treat a MSA with¬ 
out classifying its columns into cores and inserts if one 
ignores the possibility of adding new sequences in the 
future. In fact, this approach is adopted by the GREM- 
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LIN method by Balakrishnan et al. |2S] that is based 
on the Markov random fields (the present lattice gas 
model also belongs to this class of statistical models). 
In practice, however, they discarded columns with ex¬ 
cessive gaps. Such a preprocessing seems to be required 
because alignments within an insertion are often mean¬ 
ingless. This does not necessarily mean, however, that 
the existence of the insertion is meaningless. In any case, 
discarding columns of a MSA will lose the information 
about the linear chain structure of protein sequences as 
well as the possibility of adding new sequences without 
changing the core structure of the MSA. The present lat¬ 
tice gas model resolves the shortcomings of these previous 
models as both bonded and non-bonded interactions as 
well as insertions naturally emerge from a single frame¬ 
work. The main tricks here are the classification of core 
and insert sites and the use of residue counts, ns(a|X) 
and n b ss , (a, 6|X), as fundamental variables rather than 
the raw alignment sequences (X). These are especially 
important for treating insert sites where any number of 
residues are allowed to exist. The lattice gas model can 
compute the probability of an entire alignment, and what 
has been conventionally regarded as the probability of 
residue occurrence at sites should be regarded as the ex¬ 
pected number of residues at the sites. 

From a theoretical point of view, the present formula¬ 
tion of the lattice gas model offers an interesting perspec¬ 
tive regarding the interplay between local and nonlocal 
interactions. As can be seen from the relations Eqs. <[5j)— 
([8]), or more precisely, from the analogous relations that 
hold for the number densities, local and nonlocal interac¬ 
tions are not independent of each other, but are related 
via single-site densities. In this sense, local and nonlo¬ 
cal interactions must be consistent with each other ?ilj . 
and the consistency is inherently embedded in a (well- 
curated) MSA. In the conventional formulation of the 
direct-coupling analysis, only the relations correspond¬ 
ing to Eqs. ([7| and (|8j) are present because the chain 
structure is absent. Since the parameters conjugate to 
the single-site densities are external fields (chemical po¬ 
tentials in the present case) which are not intrinsic to 
the system, the relations Eqs. Q and ^ alone do not 
address the consistency between local and nonlocal inter¬ 
actions. 


In this study, I have adopted the Gaussian approxima¬ 
tion for the non-bonded coupling parameters (Eq. 64) 
as well as the mean-field approximation (Eq. 171 for 
computing the partition function. This approach has 
its advantages and disadvantages. The advantages are 
that the parameters are readily obtained and that the 
partition function can be computed analytically and effi¬ 
ciently. These enable us to study the system under var¬ 
ious perturbations relatively easily. A major disadvan¬ 
tage is that it is not possible to determine the K matrix 
self-consistently. I therefore resorted to the Gaussian ap¬ 
proximation by implicitly assuming that each site is in¬ 
dependent of other sites, which is not fully consistent 
with the lattice structure of the system. The reason for 


this inconsistency is likely to be that the assumption for 
the mean-field approximation (i.e., non-bonded interac¬ 
tions are relatively weak; see references [III 2T] ) does not 
actually hold in the present case. Due to this approxi¬ 
mation, the system does not exhibit a phase transition 
that might be induced by increased temperatures or by 
mutations at potentially important sites. In addition, 
the Gaussian approximation required that the diagonal 
blocks of the I< matrix, KoiOiia^b), be used as in Eq. 
(64), otherwise the reference state was found to be unsta¬ 
ble. The diagonal blocks represent self-interactions, and 
hence, are purely site-specific quantities. In this sense, 
they obscure the mechanism by which the interactions 
of each site with other sites induce the residue compo¬ 
sition of that site. Overcoming these problems would 
require the direct maximization of the Lagrangian (Eq. 


131 with respect to the parameters K$s' (a, b) without di¬ 


agonal (and bonded pair) blocks. It is also possible to ap¬ 
ply other approximate methods such as pseudo-likelihood 
maximization mm eh. 


Despite these limitations in the treatment of non- 
bonded interactions, the present results already provided 
some interesting observations regarding the role of non- 
bonded interactions. An increased temperature exerts a 
global and unbiased perturbation on the system. In this 
case, it was found that non-bonded energy did not signif¬ 
icantly change compared to the bonded energy (Figures 
[5jil and HP)- This implies that the residue compositions 
at each site adapt to the perturbation in a cooperative 
manner so that they stay stable. This in turn suggests, 
at least within the limitation of the approximations, that 
a protein family can accommodate a diverse variety of 
amino acid sequences as far as the pattern of correlations 
between sites is conserved. On the other hand, the vir¬ 
tual alanine scanning revealed a more conspicuous effect 
of non-bonded interactions. Alanine mutations at well- 
conserved sites disturbed the system to a greater extent 
as measured by free energy, Gibbs energy and total diver¬ 
gence (Figures |To| and [l2| , and the relation between inter¬ 
nal energy and entropy changes was completely different 
from those of J-only systems. In particular, the observa¬ 
tion that many or most of the free energy changes were 
negative for the J-only system (Figures EK and [hn 
suggests that residue conservation cannot be explained 
without considering nonlocal (non-bonded) effects. 


The interactions in the lattice gas model originate 
solely from the statistics of a MSA. They are therefore 
not directly related to physical interactions. However, 
it has been demonstrated that the K matrix as used in 
this study is a good predictor of physical contacts in na¬ 
tive protein structures pun]. To further confirm this, 
the present results showed that the effect of non-bonded 
(statistical) interactions was more pronounced in the V- 
set domain (an all-/? fold, involving more nonlocal phys¬ 
ical interactions) than in the globin domain (an all-a 
fold, involving less nonlocal interactions). In addition, 
the J-only system showed relatively better correlations 
with conservation for the globin than for the V-set do- 
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main, indicating that the bonded interactions also reflect 
physical local interactions to some extent. This point is 
also supported by the correlation, albeit weak, between 
divergence and secondary structures (Figures [6] and [8]). 
Thus, the lattice gas model provides a means to connect 
the information in amino acid sequence with the under¬ 
lying three-dimensional structure of the domain. This 
connection cannot be addressed directly in conventional 
sequence analysis methods such as the profile HMM. In 
fact, the very existence of long-range correlations indi¬ 
cates that MSA’s cannot be modeled as a purely one¬ 
dimensional system where long-range correlations simply 
cannot exist [32]. Considering this fact, it is surpris¬ 
ing that conventional multiple sequence alignment meth¬ 
ods, inherently based on the one-dimensional system, can 
produce MSA’s with long-range correlations. This may 
be a manifestation of the consistency principle indicated 
above [29] . 

There are a few possible extensions and applications 
of the present lattice gas model. In the present form, 
the model is autonomous in the sense that it does not 
require an input or target sequence for computing var¬ 
ious statistical quantities (once the observed statistical 
quantities are obtained). Nevertheless, it is readily pos¬ 
sible to align the model with a particular amino acid 
sequence to compute a partition function and therefore 
other quantities conditioned on that input sequence. In 


this way, the lattice gas model may be used for detect¬ 
ing remote homologs. The present results (e.g., Figures[ 6 ] 
and [ 8 ]) suggest that inclusion of non-bonded interactions 
would increase the specificity of the alignment. Further¬ 
more, the model can be aligned with a “sequence” of a 
given length with unspecified amino acid residues to com¬ 
pute the partition function that is conditioned on all the 
amino acid sequences of that length. In this way, one 
can enumerate those sequences that are compatible with 
the model. In other words, the model may be used for 
designing optimal sequences for a given protein family. 
Such applications may be pursued in the future to open 
new possibilities in protein sequence analysis. 
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